# Project 1: Regression

This project asks you to perform various experiments with regression. The dataset we are using is taken from a real estate dataset:

https://www.kaggle.com/datasets/mirbektoktogaraev/madrid-real-estate-market

The objective of this project is to become familiar with the underlying techniques of machine learning, and implement some of the techniques yourself. 

You will write code and discussion texts into code and text cells in this notebook. 

If a block starts with TODO:, this means that you need to write something there. 

Some code had been written for you to guide the project. Don't change the already written code.

## Grading
The points add up to 42, that is 30 + 12 bonus points. While there is no difference between the regular and the bonus points, I recommend that you solve the problems labeled "BONUS" after you finished the other ones. 


In [None]:
import pandas as pd
import numpy as np
import scipy as sp
import sklearn as sk

## Setup for the first part of the project

For problems P1 to P6 we are using a simple dataset where we extract one 
explanatory variable ``sq_mt_built`` to predict the price of the house ``buy_price``

In [None]:
df = pd.read_csv("houses_Madrid.csv")
print(f"The lenght {len(df.index)}")
print(f"The columns of the database {df.columns}")
df[["sq_mt_built", "buy_price"]].plot.scatter(x="sq_mt_built", y="buy_price")
## FIXME: add here the creation of the training data and test data

df_shuffled = df.sample(frac=1) # shuffle the rows

In [None]:
x = df_shuffled["sq_mt_built"].to_numpy(dtype=np.float64)
y = df_shuffled["buy_price"].to_numpy(dtype=np.float64)
training_data_x = x[:16000]
training_data_y = y[:16000]
test_data_x = x[16000:]
test_data_y = y[16000:]

In [None]:
training_data_y

## P1: Loss function (3 pts)
Implement a root-mean-square error (RMSE) loss function between the prediction $\hat{y}$ and $y$ value using Python operations. Run some experiments to validate that this works as expected. 
Then, look up the same in the sklearn library
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
and implement it based on what is there. 

In [None]:
# TODO: implement the loss function here using Python math ops and sklearn
def loss_RMSE(y, yhat):
    return 0

def loss_RMSE_sk(y, yhat):
    return 0

In [None]:
# TODO: Now, run some experiments with your function, with the one taken with sklearn 
# Compare their outputs.

## P2: Implement a linear predictor (3 pts)
Implement a function of type ``predict(x, theta) --> y_hat`` which implements a linear model of the type $\hat{y} = \theta_1 \cdot x + \theta_0$

In [None]:
# TODO: implement the predictor function here
def predict(x, theta):
    y_hat = 0
    return y_hat

In [None]:
# TODO: now, run some experiments with it

## P3: Implement a "grid search" function (3 pts)
Implement a function grid_search() which returns an estimate of the best $\theta$ by trying out all the combinations of possibilities on a grid and returning the values that give you the most values. 
gridx and gridy define the range of numbers that we want to explore. For instance, grid0 might be [0, 0.25, 0.5, 0.75, 1.0] 

In [None]:
# TODO: implement the grid search function here 
def grid_search(training_data_x, training_data_y, grid0, grid1):
    theta = [0, 0]
    return theta

In [None]:
# TODO: run some experiments with grid_search
# Define some grid values. Train it on the data set. Test it on the test set. 
# Print the loss on the data set and the test set. Measure and print how long the training takes.

In [None]:
# TODO: repeat the experimentation from above with different grids. 
# Finally, print the grid that provides the best value while still running faster 
# than 10 seconds.

## P4: Implement a random search function (3 pts)
Implement a function that returns the estimate for the best $\theta$ by trying out random 
$\theta=[\theta_0, \theta_1]$ values, and returning the one that minimizes the error on the training set passed to it. The number of tries is described in the ``trials`` parameter.

In [None]:
# TODO: implement the random search function here 
def grid_search(training_data_x, training_data_y, trials):
    theta = [0, 0]
    return theta

In [None]:
# TODO: run some experiments with random_search
# Choose some value for trial. Train it on the data set. Test it on the test set. 
# Print the loss on the data set and the test set. Measure and print how long the training takes.

## P5: Bonus: Improvements (3 pts)
Propose an improvement to the algorithms you have implemented for 4 and 5 and show that your improvements perform better than the original. Some examples of what you might try:
* Choose values for $\theta_0$ and $\theta_1$ on a non-uniform grid
* First find one of them, and fix it, and then refine on the other one
* For random: sample according to a non-uniform distribution
* First use a low resolution search to find the approximate values of  $\theta_0$ and $\theta_1$, then search for a more precise value

In [None]:
# TODO: implement your improvements here

TODO: Describe in one paragraph the conclusions you have drawn from your improvement experiments

## P6: Using the sklearn library (3 pts)

Use ``sklearn.linear_model.LinearRegression`` to solve the same problem you previously solved using the grid search and random search.

Compare the returned values with what you have achieved. Compare the parameters that had been found to the parameters you have found. Compare the speed. 

In [None]:
# TODO: Implement here

In [None]:
# TODO: Run performance experiments here. 

TODO: discuss the performance of the sklearn library implementation, compared to what you implemented.

# Setup for the second part of the project
For the questions P7-P10 we use linear regression on a multivariate setting. This time, there are 7 explanatory variables: ``sq_mt_built``, ``n_rooms``, ``n_bathrooms``, ``is_renewal_needed``, ``is_new_development`` and ``has_fitted_wardrobes``. 

We will first create the training and test data while doing some minimal data cleaning.

In [None]:
# replacing the NA values with some sensible defaults
# the way I was investigating these is by printing 
#    df["has_individual_heeating"].value_counts(dropna=False) etc
df_shuffled["has_individual_heating"] = df_shuffled["has_individual_heating"].fillna(False)
df_shuffled["n_bathrooms"] = df_shuffled["n_bathrooms"].fillna(1)
df_shuffled["has_individual_heating"] = df_shuffled["has_individual_heating"].fillna(False)
df_shuffled["is_new_development"] = df_shuffled["is_new_development"].fillna(False)
df_shuffled["has_fitted_wardrobes"] = df_shuffled["has_fitted_wardrobes"].fillna(False)

xfields = ["sq_mt_built", "n_rooms", "n_bathrooms", "has_individual_heating", \
           "is_renewal_needed", "is_new_development", "has_fitted_wardrobes"]

x = df_shuffled[xfields].to_numpy(dtype=np.float64)
y = df_shuffled["buy_price"].to_numpy(dtype=np.float64)
training_data_x = x[:16000]
training_data_y = y[:16000]
test_data_x = x[16000:]
test_data_y = y[16000:]

## P7: Implement grid search for multiple variables (3 pts)
Implement the linear predictor model for multiple variables. Note that this time ``x`` will be an array of 7 values, and ``theta`` will be an array of 8 values. 

Then, implement a grid search function (similar to P3) but this time for the 7 explanatory variables. Pass the grids as an array into the grid variable. 

In [None]:
# TODO: implement the predictor function here
def predict_multi(x, theta):
    y_hat = 0
    return y_hat

# TODO: implement the grid search function here 
def grid_search(training_data_x, training_data_y, grids):
    theta = [0, 0]
    return theta

In [None]:
# TODO: run experiments with your implementation for the grid search 

TODO: describe here your experiences with implementing this problem, conclusions you draw. 

## P8: Random search for multiple variables (3 pts)
Implement the random search technique for the multiple variables. 

In [None]:
# TODO: implement the random seeach function here 
def grid_search(training_data_x, training_data_y):
    theta = [0, 0]
    return theta

In [None]:
# TODO: run experiments with your implementation for the

TODO: describe here your experiments with the implementation for the random search

## P9: Use sklearn for linear regression in multiple variables (3 pts)
Use ``sklearn.linear_model.LinearRegression`` to solve the same problem you previously solved using the grid search and random search.

Compare the returned values with what you have achieved. Compare the parameters that had been found to the parameters you have found. Compare the speed.

In [None]:
# TODO: implement here

In [None]:
# TODO: run experiments here. 

TODO: describe in one paragraph your experiences with implementing the multiple variable linear regression. 

## P10: Bonus: Data wrangling (3 pt)

Perform data preprocessing / cleaning / wrangling on the multiple variable dataset. This might include changing the range, removing outliers, etc. The objective is to obtain a better performance by a regressor as measured on the test data. Document your experiments with plots etc.

In [None]:
# TODO: insert the code you use to investigate the properties of the data here

In [None]:
# TODO: implement the data transformations here

In [None]:
# TODO: run experiments with the transformed data here. Measure the performance

TODO: describe in one paragraph your experiences with the data wrangling process.

## P11: Bonus: explore other linear regression techniques (3 pts)

Explore the use of other models provided from the sklearn library for linear regression. 

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model

Try out two of them of your choice. Explain the results you obtained and compare them with other approaches. 

# K-nearest neighbors

## P12: K-nearest neighbors for the single variable case (3 pts)
Implement the k-nearest neighbor algorithm for the single variable case. Given an x value, find the k closest values from the training data, and return their average. 

In [None]:
# TODO: implement here 
def predict_k_nearest(x, k):
    y_hat = 0
    return y_hat

In [None]:
# TODO: experiment here in terms of accuracy and speed. Experiment with multiple values of k

TODO: Write a paragraph about the results of the experiments. How does it compare 
to the other techniques you implemented above?

## P13: K-nearest neighbors for the multiple variable case (3 pts)
Implement the k-nearest neighbor algorithm for the multiple variable case. 

In [None]:
# TODO: implement here 
def predict_k_nearest_multiple(x, k):
    y_hat = 0
    return y_hat

In [None]:
# TODO: experiment here in terms of accuracy and speed. Experiment with multiple values of k. 

TODO: Write a paragraph about the results of the experiments. How does it compare 
to the other techniques you implemented above?

## P14: Bonus: Experiment with the sklearn implementation of K-nearest neighbors (3 pts)
Using the ``sklearn sklearn.neighbors.KNeighborsRegressor`` model, implement the multi-variable regression model. Run experiments with different values of the $k$ hyper-parameter. 

In [None]:
# TODO: run experiments here

TODO: Write a paragraph about the results of the experiments. How does it compare to the techniques you implemented for k-nearest neighbor in terms of accuracy and speed. Provide an explanation for the results. 